-
Notifications
You must be signed in to change notification settings - Fork 197
fix: zombie processes during restart #10650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: zombie processes during restart #10650
Conversation
6186951 to
a32c16f
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some relatively minor nitpicks.
2cae2b1 to
b37db6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits, but it looks good, will approve after green CI
| cfg.Settings.Collector, | ||
| monitor.ComponentMonitoringConfig, | ||
| cfg.Settings.ProcessConfig.StopTimeout, | ||
| 3*time.Second, // this needs to be shorter than 5 * time.Seconds (coordinator.managerShutdownTimeout) otherwise we might end up with defunct processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth to make it a const with a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in f66af91
| s.log.Warnf("timeout waiting (%s) for the supervised collector to stop, killing it", waitTime.String()) | ||
| // our caller ctx is Done; kill the process just in case | ||
| _ = s.processInfo.Kill() | ||
| case <-time.After(1 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so worst case is waitTime + 1s. please update func docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in f66af91
b37db6c to
f66af91
Compare
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
What does this PR do?
This PR fixes zombie/defunct processes that are left behind when Elastic Agent re-executes itself during restart. The fix involves:
Wait()is calledWhy is it important?
Root Cause
When the Elastic Agent re-executes itself during restart, the following sequence occurs:
execveitselfexecve, all threads other than the calling thread are destroyedPDeathSigmechanism we enable for subprocessesWhy This Affects EDOT More Than Beats
Beats subprocesses typically terminate almost immediately (within the 5-second window), so they don't become zombies. However, the EDOT collector's shutdown time seemed to be affected by:
Impact
This fix ensures proper process cleanup regardless of shutdown duration while maintaining graceful termination when possible.
Checklist
./changelog/fragmentsusing the changelog toolDisruptive User Impact
Users may notice:
How to test this PR locally
Run
TestMetricsMonitoringCorrectBinariesintegration testRelated issues